{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "917b5345-2958-444f-a81f-6c3ae328912e",
   "metadata": {},
   "source": [
    "# Knowledge Distillation For Fine-Tuning A GPT-3.5 Judge (Correctness)\n",
    "\n",
    "This notebook has to do with fine-tuning an LLM Judge that evaluates the responses of another LLM to a user query. More specifically, we demonstrate how to use the `llama_index` library to distill knowledge from a GPT-4 Judge to a GPT-3.5 Judge. To do so, we will take the following steps:\n",
    "\n",
    "1. Generate datasets: `train` and `test`\n",
    "2. Perform knowledge distillation (using `train`)\n",
    "3. Evaluate the distilled model  on `test`\n",
    "\n",
    "More specifically, we will use `CorrectnessEvaluator` as our LLM Judge."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2acfbbe5",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install llama-index-readers-wikipedia\n",
    "%pip install llama-index-finetuning\n",
    "%pip install llama-index-llms-openai\n",
    "%pip install llama-index-finetuning-callbacks\n",
    "%pip install llama-index-llms-huggingface"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8ecb4e18",
   "metadata": {},
   "outputs": [],
   "source": [
    "# NOTE: this notebook makes several API calls to generate text with OpenAI GPT\n",
    "# models as well as models hosted on HuggingFace. If you prefer not to wait for\n",
    "# these generations, then the data for this notebook can be obtained with the\n",
    "# `wget` command provided below.\n",
    "\n",
    "# !wget \"https://www.dropbox.com/scl/fo/3kkm8v6qvhxnu449xwp3d/h?rlkey=fxom1yixru1nags9mmao1hkg2&dl=1\" -O correctness.zip"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5b2f8b72-3337-4a2a-929c-87083bbf7fc2",
   "metadata": {},
   "outputs": [],
   "source": [
    "import nest_asyncio\n",
    "\n",
    "nest_asyncio.apply()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "065f7df9-a048-43aa-b862-290e832ea631",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "# we will be using models on HuggingFace as our LLM answer generators\n",
    "HUGGING_FACE_TOKEN = os.getenv(\"HUGGING_FACE_TOKEN\")\n",
    "\n",
    "# we will use GPT-4 and GPT-3.5 + OpenAI Fine-Tuning\n",
    "OPENAI_API_KEY = os.getenv(\"OPENAI_API_KEY\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c89edf1-b359-4370-8b1e-fad279508c68",
   "metadata": {},
   "source": [
    "## Step 1 Generate datasets: `train_dataset` and `test_dataset`\n",
    "\n",
    "For our dataset on which we will generate questions and prompt various LLMs to answer, we're going to use the `WikipediaReader` to read \"History of <city>\" for several cities."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6ebbecc4-543d-41d8-ae96-5f729e7e8a8f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.2.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.3.1\u001b[0m\n",
      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "!pip install wikipedia -q"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ff858baa-5ae1-4dda-8407-6d23237b6b9c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# wikipedia pages\n",
    "from llama_index.readers.wikipedia import WikipediaReader\n",
    "\n",
    "cities = [\n",
    "    \"San Francisco\",\n",
    "    \"Toronto\",\n",
    "    \"New York\",\n",
    "    \"Vancouver\",\n",
    "    \"Montreal\",\n",
    "    \"Tokyo\",\n",
    "    \"Singapore\",\n",
    "    \"Paris\",\n",
    "]\n",
    "\n",
    "documents = WikipediaReader().load_data(\n",
    "    pages=[f\"History of {x}\" for x in cities]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c66486ab-38cf-4ed6-bef4-6fe9deee0590",
   "metadata": {},
   "source": [
    "### Use a `DatasetGenerator` to build `train_dataset` and `test_dataset`\n",
    "\n",
    "Now that we have our train and test set of `Document`'s, the next step is to generate the questions. For this we will use the `DatasetGenerator`, which uses an LLM to generate questions from given set of documents."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cef531ed-8e97-4d8f-8cc3-6f7e0c6ca141",
   "metadata": {},
   "source": [
    "#### Generate Questions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "99afd212-38b0-492c-91ea-a810e126ad2d",
   "metadata": {},
   "outputs": [],
   "source": [
    "QUESTION_GEN_PROMPT = (\n",
    "    \"You are a Teacher/ Professor. Your task is to setup \"\n",
    "    \"a quiz/examination. Using the provided context, formulate \"\n",
    "    \"a single question that captures an important fact from the \"\n",
    "    \"context. Restrict the question to the context information provided.\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e667f6ed-145b-41f3-92d2-be36aa1e47e6",
   "metadata": {},
   "outputs": [],
   "source": [
    "# generate questions against chunks\n",
    "from llama_index.core.evaluation import DatasetGenerator\n",
    "from llama_index.llms.openai import OpenAI\n",
    "\n",
    "# set context for llm provider\n",
    "gpt_35_llm = OpenAI(model=\"gpt-3.5-turbo\", temperature=0.3)\n",
    "\n",
    "# instantiate a DatasetGenerator\n",
    "dataset_generator = DatasetGenerator.from_documents(\n",
    "    documents,\n",
    "    question_gen_query=QUESTION_GEN_PROMPT,\n",
    "    llm=gpt_35_llm,\n",
    "    num_questions_per_chunk=25,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "05ce06f6-1e85-4694-a4ac-9b4d465ed9e7",
   "metadata": {},
   "outputs": [],
   "source": [
    "qrd = dataset_generator.generate_dataset_from_nodes(num=350)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "512e2f82-29b7-41cc-8f23-ae254f873031",
   "metadata": {},
   "outputs": [],
   "source": [
    "# If you want to save it for future use\n",
    "# qrd.save_json(\"qrd.json\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b201d9cb-4746-4c71-8728-55e56cb8b76f",
   "metadata": {},
   "source": [
    "#### Generate Answers To The Questions\n",
    "\n",
    "The next step is to generate answers using an LLM. Just a reminder, that the point is to judge these generated answers. So later on, we will use GPT models to judge these answers.\n",
    "\n",
    "For the generation of the answers to the questions, we will use another LLM, namely: Llama-2. In order to do this, we first a create a vector store for our documents and an associated retriever, which this LLM answer-generator will use."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "60f08da0-27c0-4805-8c99-fad7041b3b16",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import VectorStoreIndex\n",
    "from llama_index.core.retrievers import VectorIndexRetriever\n",
    "\n",
    "# Create vector index\n",
    "the_index = VectorStoreIndex.from_documents(documents=documents)\n",
    "\n",
    "# Create the retriver on this index\n",
    "the_retriever = VectorIndexRetriever(\n",
    "    index=the_index,\n",
    "    similarity_top_k=2,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "65125845-b96c-4f84-b45c-7ddaf6910d92",
   "metadata": {},
   "source": [
    "From here we will build `RetrieverQueryEngine`'s that will take in our queries (i.e. questions) for processing. Note that we use `HuggingFaceInferenceAPI` for our LLM answer-generators, and that Llama-2 requires permissions. If you haven't yet gain accessed to these models, then feel free to swap out Llama-2 with another model of your choosing.\n",
    "\n",
    "At this point we will break off the generated questions into two sets: one for building `train_dataset` and another for `test_dataset` that we will build in the next section."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eb1d6f7f-7067-48a8-ae8f-5619dc18d5d6",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/nerdai/Library/Caches/pypoetry/virtualenvs/llama-index-e6cjsBOJ-py3.10/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n"
     ]
    }
   ],
   "source": [
    "from llama_index.core.query_engine import RetrieverQueryEngine\n",
    "from llama_index.llms.huggingface import HuggingFaceInferenceAPI\n",
    "\n",
    "llm = HuggingFaceInferenceAPI(\n",
    "    model_name=\"meta-llama/Llama-2-7b-chat-hf\",\n",
    "    context_window=2048,  # to use refine\n",
    "    token=HUGGING_FACE_TOKEN,\n",
    ")\n",
    "\n",
    "query_engine = RetrieverQueryEngine.from_args(retriever=the_retriever, llm=llm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cb345deb-d34e-4b48-9f72-ef108ad3afbc",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 79/79 [08:30<00:00,  6.46s/it]\n"
     ]
    }
   ],
   "source": [
    "import tqdm\n",
    "\n",
    "# we will use 65% of the generated questions for training\n",
    "train_dataset = []\n",
    "num_train_questions = int(0.65 * len(qrd.qr_pairs))\n",
    "\n",
    "for q, a in tqdm.tqdm(qrd.qr_pairs[:num_train_questions]):\n",
    "    # data for this q\n",
    "    data_entry = {\"question\": q, \"reference\": a}\n",
    "    response = query_engine.query(q)\n",
    "    response_struct = {}\n",
    "    response_struct[\"model\"] = \"llama-2\"\n",
    "    response_struct[\"text\"] = str(response)\n",
    "    response_struct[\"context\"] = (\n",
    "        response.source_nodes[0].node.text[:1000] + \"...\"\n",
    "    )\n",
    "\n",
    "    data_entry[\"response_data\"] = response_struct\n",
    "    train_dataset.append(data_entry)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6e653515-75c4-4987-87e4-c0a0b17a0bdf",
   "metadata": {},
   "source": [
    "### Get GPT-4 Evaluations On The Mistral and LLama-2 Answers \n",
    "\n",
    "As mentioned a couple of times before, the point of this guide is fine-tune an LLM judge from a GPT-4 judge. So, in order to complete our `train_dataset` we now need to instantiate our GPT-4 judge and have it evaluate the answers that were provided by Llama-2. To do this, we will use the `CorrectnessEvaluator` class. What this judge will do then is it will compare the answer to a reference answer and provide a score between 1 and 5 (higher is better) on how close the provided answer aligns to the reference one.\n",
    "\n",
    "Note also that we use the `OpenAIFineTuningHandler` which will collect all the chat histories that we will eventually need to fine-tune GPT-3.5."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "17f8bbcf-6bc6-4a6c-a778-d43c6896fe89",
   "metadata": {},
   "outputs": [],
   "source": [
    "# instantiate the gpt-4 judge\n",
    "from llama_index.llms.openai import OpenAI\n",
    "from llama_index.finetuning.callbacks import OpenAIFineTuningHandler\n",
    "from llama_index.core.callbacks import CallbackManager\n",
    "from llama_index.core.evaluation import CorrectnessEvaluator\n",
    "\n",
    "finetuning_handler = OpenAIFineTuningHandler()\n",
    "callback_manager = CallbackManager([finetuning_handler])\n",
    "gpt_4_llm = OpenAI(\n",
    "    temperature=0, model=\"gpt-4\", callback_manager=callback_manager\n",
    ")\n",
    "\n",
    "gpt4_judge = CorrectnessEvaluator(llm=gpt_4_llm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9225358e-e898-4f32-a99b-47a67941f715",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 79/79 [12:31<00:00,  9.51s/it]\n"
     ]
    }
   ],
   "source": [
    "import tqdm\n",
    "\n",
    "# for `training`\n",
    "for data_entry in tqdm.tqdm(train_dataset):\n",
    "    eval_result = await gpt4_judge.aevaluate(\n",
    "        query=data_entry[\"question\"],\n",
    "        response=data_entry[\"response_data\"][\"text\"],\n",
    "        context=data_entry[\"response_data\"][\"context\"],\n",
    "        reference=data_entry[\"reference\"],\n",
    "    )\n",
    "\n",
    "    # save final result\n",
    "    judgement = {}\n",
    "    judgement[\"llm\"] = \"gpt_4\"\n",
    "    judgement[\"score\"] = eval_result.score\n",
    "    judgement[\"text\"] = eval_result.response\n",
    "    data_entry[\"evaluations\"] = [judgement]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f56ed14f-2209-4fbb-86cf-2fa69aa1fdc1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Wrote 79 examples to correction_finetuning_events.jsonl\n"
     ]
    }
   ],
   "source": [
    "finetuning_handler.save_finetuning_events(\"correction_finetuning_events.jsonl\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11eb16d0-15b1-45f3-96b0-3b51574d1626",
   "metadata": {},
   "source": [
    "## Step 2 Perform knowledge distillation\n",
    "\n",
    "Okay, it's now time to distill some knowledge from GPT-4 to GPT-3.5 To do this, we will make use of the `OpenAIFinetuneEngine` class as well as the `correction_finetuning_events.jsonl` file that we just created."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bb90eb9e-aad3-4bce-a9b5-c2343c4e3d66",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.finetuning import OpenAIFinetuneEngine\n",
    "\n",
    "finetune_engine = OpenAIFinetuneEngine(\n",
    "    \"gpt-3.5-turbo\",\n",
    "    \"correction_finetuning_events.jsonl\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "32b6b392-8932-4eda-93eb-911513be3cfb",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Num examples: 79\n",
      "First example:\n",
      "{'role': 'system', 'content': '\\nYou are an expert evaluation system for a question answering chatbot.\\n\\nYou are given the following information:\\n- a user query,\\n- a reference answer, and\\n- a generated answer.\\n\\nYour job is to judge the relevance and correctness of the generated answer.\\nOutput a single score that represents a holistic evaluation.\\nYou must return your response in a line with only the score.\\nDo not return answers in any other format.\\nOn a separate line provide your reasoning for the score as well.\\n\\nFollow these guidelines for scoring:\\n- Your score has to be between 1 and 5, where 1 is the worst and 5 is the best.\\n- If the generated answer is not relevant to the user query, you should give a score of 1.\\n- If the generated answer is relevant but contains mistakes, you should give a score between 2 and 3.\\n- If the generated answer is relevant and fully correct, you should give a score between 4 and 5.\\n\\nExample Response:\\n4.0\\nThe generated answer has the exact same metrics as the reference answer,     but it is not as concise.\\n\\n'}\n",
      "{'role': 'user', 'content': '\\n## User Query\\nWhat event in 1906 caused significant damage to San Francisco but was followed by a quick rebuild?\\n\\n## Reference Answer\\nThe great earthquake and fire in 1906 caused significant damage to San Francisco but was followed by a quick rebuild.\\n\\n## Generated Answer\\n1906 earthquake and fire.\\n'}\n",
      "{'role': 'assistant', 'content': '4.0\\nThe generated answer is relevant and correct, but it lacks the detail and context provided in the reference answer.'}\n",
      "No errors found\n",
      "Num examples missing system message: 0\n",
      "Num examples missing user message: 0\n",
      "\n",
      "#### Distribution of num_messages_per_example:\n",
      "min / max: 3, 3\n",
      "mean / median: 3.0, 3.0\n",
      "p5 / p95: 3.0, 3.0\n",
      "\n",
      "#### Distribution of num_total_tokens_per_example:\n",
      "min / max: 315, 782\n",
      "mean / median: 479.49367088607596, 465.0\n",
      "p5 / p95: 355.6, 634.6\n",
      "\n",
      "#### Distribution of num_assistant_tokens_per_example:\n",
      "min / max: 19, 110\n",
      "mean / median: 57.63291139240506, 56.0\n",
      "p5 / p95: 29.6, 83.2\n",
      "\n",
      "0 examples may be over the 4096 token limit, they will be truncated during fine-tuning\n",
      "Dataset has ~37880 tokens that will be charged for during training\n",
      "By default, you'll train for 3 epochs on this dataset\n",
      "By default, you'll be charged for ~113640 tokens\n",
      "As of August 22, 2023, fine-tuning gpt-3.5-turbo is $0.008 / 1K Tokens.\n",
      "This means your total cost for training will be $0.30304000000000003 per epoch.\n"
     ]
    }
   ],
   "source": [
    "# We can check the status of our current job as follows\n",
    "# This may take some time ...\n",
    "finetune_engine.finetune()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "32d0234b-040d-44f6-901b-38e6eb241644",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<FineTuningJob fine_tuning.job id=ftjob-9y8G7rzbCkzPjsKtPMsfwRSu at 0x1778d6a70> JSON: {\n",
       "  \"object\": \"fine_tuning.job\",\n",
       "  \"id\": \"ftjob-9y8G7rzbCkzPjsKtPMsfwRSu\",\n",
       "  \"model\": \"gpt-3.5-turbo-0613\",\n",
       "  \"created_at\": 1698851177,\n",
       "  \"finished_at\": 1698851823,\n",
       "  \"fine_tuned_model\": \"ft:gpt-3.5-turbo-0613:llamaindex::8G7FovVj\",\n",
       "  \"organization_id\": \"org-1ZDAvajC6v2ZtAP9hLEIsXRz\",\n",
       "  \"result_files\": [\n",
       "    \"file-bx2ObrpVPq7Q2pmv743W1eFQ\"\n",
       "  ],\n",
       "  \"status\": \"succeeded\",\n",
       "  \"validation_file\": null,\n",
       "  \"training_file\": \"file-xAwZ2NSzbck3p8u24kznzySX\",\n",
       "  \"hyperparameters\": {\n",
       "    \"n_epochs\": 3\n",
       "  },\n",
       "  \"trained_tokens\": 113166,\n",
       "  \"error\": null\n",
       "}"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "finetune_engine.get_current_job()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3e631ccf-b4f5-478d-b112-52dc88ffed1e",
   "metadata": {},
   "source": [
    "## 3 Evaluate The Fine-Tuned GPT-3.5 Judge On The Test Dataset\n",
    "\n",
    "Now that we have our fine-tuned GPT-3.5, let's see how well it performs on a test set. But first, remember that we said we'd hold off on creating the `test_dataset` until the time comes that we need it? Well, that time is now. So we will repeat the process of creating the `train_dataset` here but instead now for the `test_dataset`.\n",
    "\n",
    "NOTE: generating these answers and evaluations will take some time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3e6db08a-da63-4f0f-b148-16ed089ca585",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44/44 [05:07<00:00,  6.99s/it]\n"
     ]
    }
   ],
   "source": [
    "# Use Llama-2 to generate answers to the test questions\n",
    "test_dataset = []\n",
    "for q, a in tqdm.tqdm(qrd.qr_pairs[num_train_questions:]):\n",
    "    # data for this q\n",
    "    data_entry = {\"question\": q, \"reference\": a}\n",
    "    response = query_engine.query(q)\n",
    "    response_struct = {}\n",
    "    response_struct[\"model\"] = \"llama-2\"\n",
    "    response_struct[\"text\"] = str(response)\n",
    "    response_struct[\"context\"] = (\n",
    "        response.source_nodes[0].node.text[:1000] + \"...\"\n",
    "    )\n",
    "\n",
    "    data_entry[\"response_data\"] = response_struct\n",
    "    test_dataset.append(data_entry)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b70e535a-bd3a-4664-943a-1bf3ff02a8da",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44/44 [06:52<00:00,  9.37s/it]\n"
     ]
    }
   ],
   "source": [
    "# get the gpt-4 judgements on the Llama-2 answers\n",
    "for data_entry in tqdm.tqdm(test_dataset):\n",
    "    eval_result = await gpt4_judge.aevaluate(\n",
    "        query=data_entry[\"question\"],\n",
    "        response=data_entry[\"response_data\"][\"text\"],\n",
    "        context=data_entry[\"response_data\"][\"context\"],\n",
    "        reference=data_entry[\"reference\"],\n",
    "    )\n",
    "\n",
    "    # save final result\n",
    "    judgement = {}\n",
    "    judgement[\"llm\"] = \"gpt_4\"\n",
    "    judgement[\"score\"] = eval_result.score\n",
    "    judgement[\"text\"] = eval_result.response\n",
    "    data_entry[\"evaluations\"] = [judgement]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "532d2661-3d06-42c6-9a99-86616ca4a31e",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44/44 [00:44<00:00,  1.02s/it]\n"
     ]
    }
   ],
   "source": [
    "from llama_index.core.evaluation import EvaluationResult\n",
    "\n",
    "# use our fine-tuned GPT-3.5 to evaluate the answers\n",
    "ft_llm = finetune_engine.get_finetuned_model()\n",
    "\n",
    "ft_gpt_3p5_judge = CorrectnessEvaluator(llm=ft_llm)\n",
    "\n",
    "for data_entry in tqdm.tqdm(test_dataset):\n",
    "    eval_result = await ft_gpt_3p5_judge.aevaluate(\n",
    "        query=data_entry[\"question\"],\n",
    "        response=data_entry[\"response_data\"][\"text\"],\n",
    "        context=data_entry[\"response_data\"][\"context\"],\n",
    "        reference=data_entry[\"reference\"],\n",
    "    )\n",
    "\n",
    "    # save final result\n",
    "    judgement = {}\n",
    "    judgement[\"llm\"] = \"ft_gpt_3p5\"\n",
    "    judgement[\"score\"] = eval_result.score\n",
    "    judgement[\"text\"] = eval_result.response\n",
    "    data_entry[\"evaluations\"] += [judgement]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1ee6616b-ea0a-44c4-8749-68e64e0ebbee",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44/44 [01:36<00:00,  2.19s/it]\n"
     ]
    }
   ],
   "source": [
    "# Similarly, use a non-fine-tuned judge to evaluate the answers\n",
    "gpt_3p5_llm = OpenAI(model=\"gpt-3.5-turbo\")\n",
    "\n",
    "gpt_3p5_judge = CorrectnessEvaluator(llm=gpt_3p5_llm)\n",
    "\n",
    "for data_entry in tqdm.tqdm(test_dataset):\n",
    "    eval_result = await gpt_3p5_judge.aevaluate(\n",
    "        query=data_entry[\"question\"],\n",
    "        response=data_entry[\"response_data\"][\"text\"],\n",
    "        context=data_entry[\"response_data\"][\"context\"],\n",
    "        reference=data_entry[\"reference\"],\n",
    "    )\n",
    "\n",
    "    # save final result\n",
    "    judgement = {}\n",
    "    judgement[\"llm\"] = \"gpt_3p5\"\n",
    "    judgement[\"score\"] = eval_result.score\n",
    "    judgement[\"text\"] = eval_result.response\n",
    "    data_entry[\"evaluations\"] += [judgement]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "35120c06-3218-4317-ad40-b94df1d6ca14",
   "metadata": {},
   "source": [
    "### The Metrics\n",
    "\n",
    "Phew! Now that we have generated all of the LLM judges evaluations of the Llama-2/Mistral answers on the test queries. Let's now get a quantitative view on how close fine-tuned GPT-3.5 is to GPT-4.\n",
    "\n",
    "For this, we report the Correlation between the scores of the fine-tuned (and, not-fine-tuned) GPT-3.5 to that of the GPT-4 judge."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ded5f971-488a-469b-b240-bf2b15d530f2",
   "metadata": {},
   "outputs": [],
   "source": [
    "REPORT_FMT_STR = (\n",
    "    \"{model}\\n\"\n",
    "    \"-----------------\\n\"\n",
    "    \"Number of obs.: {total_obs}\\n\"\n",
    "    \"Correlation with GPT-4: {corr}\\n\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d612ee7b-f065-49fe-80ad-344fda9ff6b1",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "scores = {\"gpt_4\": [], \"gpt_3p5\": [], \"ft_gpt_3p5\": []}\n",
    "for ix, d in enumerate(test_dataset):\n",
    "    for e in d[\"evaluations\"]:\n",
    "        scores[e[\"llm\"]].append(e[\"score\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "61637486-4df2-419f-9973-d449e18efcea",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "GPT-3.5 w/ fine-tuning\n",
      "-----------------\n",
      "Number of obs.: 44\n",
      "Correlation with GPT-4: 0.9279850303778618\n",
      "\n",
      "\n",
      "\n",
      "GPT-3.5 w/out fine-tuning\n",
      "-----------------\n",
      "Number of obs.: 44\n",
      "Correlation with GPT-4: 0.8737418723878325\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# numpy conversion\n",
    "np_scores_gpt_4 = np.array(scores[\"gpt_4\"])\n",
    "np_scores_gpt_3p5 = np.array(scores[\"gpt_3p5\"])\n",
    "np_scores_ft_gpt_3p5 = np.array(scores[\"ft_gpt_3p5\"])\n",
    "\n",
    "# correlations\n",
    "corr_ft = np.corrcoef(np_scores_gpt_4, np_scores_ft_gpt_3p5)[0, 1]\n",
    "corr_no_ft = np.corrcoef(np_scores_gpt_4, np_scores_gpt_3p5)[0, 1]\n",
    "\n",
    "print(\n",
    "    REPORT_FMT_STR.format(\n",
    "        model=\"GPT-3.5 w/ fine-tuning\",\n",
    "        total_obs=np_scores_gpt_4.shape[0],\n",
    "        corr=corr_ft,\n",
    "    )\n",
    ")\n",
    "print(\"\\n\")\n",
    "print(\n",
    "    REPORT_FMT_STR.format(\n",
    "        model=\"GPT-3.5 w/out fine-tuning\",\n",
    "        total_obs=np_scores_gpt_4.shape[0],\n",
    "        corr=corr_no_ft,\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0dde4f3-7802-49bc-95d8-b2683879a441",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "From the above numbers we see that fine-tuning a GPT-3.5 judge yields higher correlation to GPT-4 that does its non-fine-tuned counterpart. Thus, for this case, we see that fine-tuning has helped us to obtain a GPT-3.5 judge that is closer to a GPT-4 judge (and thus by proxy, closer to human judgements)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llama_index_3.10",
   "language": "python",
   "name": "llama_index_3.10"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
